home *** CD-ROM | disk | FTP | other *** search
- >From postnews Tue Mar 18 18:05:22 1986
- Subject: More Pep for Boyer-Moore Grep (part 2 of 2)
- Newsgroups: net.unix
-
- # "Gratiano speaks an infinite deal of nothing, more than any man in all
- of Venice. His reasons are as two grains of wheat hid in two bushels of
- chaff: you shall seek all day ere you find them, they are not worth
- the search." -- Shakespeare, Merchant of Venice
-
- ... or, part 2, "Reach out and Boyer-Moore Egrep Someone"
-
- Maybe you never use 'grep'. Then ignore this. But if you do, why not
- use the best algorithm? Serious addicts know that for unstructured yet
- stable text, B-trees are used for speed, or something like Lesk's nifty
- (and unavailable) 'grab' suite for inverted files are ways to go. Barring file
- inversion daemons for netnews and other ephemera, we are limited to the
- present improvements.
-
- Proper skeptics should question why a nearly I/O-bound program (but
- not for any CPU with less than the power of several VAX MIPS, alas) should
- be made more so. The question was posed in B & M's classic 1978 CACM
- paper -- the answer then was to free up more CPU cycles for timesharing.
- Now, our motivations are more mundane (we won't have desktop 5 MIP machines
- for another year), but not only that, we've discovered that the Cray 2's
- standard 'egrep' is also very anemic, performing 8-12 times as worse as ours
- on simple patterns. For shame, especially since hearing of the rumor that
- certain group theorists have a search application ready for testing.
- Boyer-Moore could fill in until a Cray vectorizing C compiler shows up.
- Sheer speed for machines whose filesystems are cached in memory is nice too.
-
- A quick-and-dirty rundown of the debts to which the new hybrid pays
- now follows.
-
- Thompson, K. T. (CACM, November 1968):
- Regular Expression Search Algorithm. As usual, obvious
- once you understand it. The current 'egrep'. Still
- useful as a base. Abstracted by Aho/Ullman as Algorithm
- 9.1 in Design and Analysis of Computer Algorithms.
-
- Boyer/Moore:
- Not quite pre-Unix. Oh well. Modern designers should
- know better now, if they want their stuff to get out there.
- By the way, I haven't used delta2 (or 1) since the O(mn) case
- case doesn't come up too often. Sure Knuth stood on his head
- to better the linearity, but his proof had a bug in it until
- the 1980 SIAM J. Comput. retraction. Would you want to code
- something that even Knuth trips up on?
-
- Now to assuage nagging feelings that geneticists might want
- to search entire libraries of 9000-unit nucleotide protein
- sequences for ((AGCA|TTGCA).*TGC)|AGCT)?T?A+ or some nonsense
- which MIGHT be nonlinear, you would want delta2. So convince
- someone to do the Galil/Apostolico/Giancarlo 2n comparison
- worst case stuff. See egrep.c for reference.
-
- Gosper, W. (HAKMEM 1972):
- Gosper didn't get around to the Thompson-like machine until
- 1972 with HAKMEM. His PDP 10 code is nevertheless valiant.
- He is also (barely) credited with conceiving the backwards
- match idea independently. Where is he now?
-
- Morris/Pratt:
- Nice guys, but for this purpose, has-beens.
- Neat to see a hacker's triumph bury some theory.
-
- Horspool (Software Practice & Experience, 1980):
- Now here's a Canadian after the heart of things
- (perfect hashing, text compression, NP-complete
- code generation probs., etc.) Did some Amdahl
- timings to show that delta2 is not so hot.
- Knows about Search For Least Frequent Character First,
- which is useful for short patterns.
-
- {,e,f}grep man page:
- The laughable bugnote "but we do not know a single algorithm
- that spans a wide enough range of space-time tradeoffs"
- certainly presumes that there is no such thing as switching
- logic. How the 'grep' family got into a multiple-version
- mess is probably a Guy Harris story; 'egrep' looks like the
- winner, as its functionality is pretty much a superset of
- the other two. The K & P teaser (p. 105) offers hope for
- unification, but we see no difference with extant V8 code.
-
- "Not cited in the text" -- the sexy randomized Karp/Rabin string searcher
- (Sedgewick, Algorithms, or Karp's Turing Award Lecture), and the ribald classic
- Time Warps, String Edits, and Macromolecules -- The Theory and Practice
- of Sequence Comparison (Kruskal & Sankoff). Inquire within.
- Thanks for your patience,
-
- James A. Woods (ames!jaw)
- NASA Ames Research Center
-
- P.S.
- Current applications for Boyer-Moore code include modification of
- 'fastfind' for true speed, as well as substring search for 'grab', both
- benefiting from BM-style search thru incrementally-compressed files/indices.
-
-